Text Categorization using the Semi-Supervised Fuzzy c-Means Algorithm

نویسندگان

  • Mohammed Benkhalifa
  • Amine Bensaid
  • Abdelhak Mouradi
چکیده

Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. For the past few years, TC has become very important essentially in the Information Retrieval area, where information needs have tremendously increased with the rapid growth of textual information sources such as the Internet. In this paper, we compare , for text categorization, two partially supervised (or semi-supervised) clustering algorithms: the “Semi-Supervised Agglomerative Hierarchical Clustering (ssAHC) algorithm [ I ] and the SemiSupervised Fuzzy-cMeans (ssFCM) algorithm [2]. This (Semi-Supervised) learning paradigm falls somewhere between the fully supervised and the fully unsupervised learning schemes, in the sense that it exploits both class information contained in labeled data (training documents) and structure information possessed by unlabeled data (test documents) in order to produce better partitions for test documents. Our experiments, make use of the Reuters 2 1578 database of documents and consist of a binary classification for each of the ten most populous categories of the Reuters database. To convert the documents into vector form, we experiment with different numbers of features, which we select based on an information gain criterion. We verify experimentally that sFCM both outperforms and takes less time than the Fuzzy -cMeans (FCM) algorithm. With a smaller number of features, ssFCM’s performance is also superior to that of ssAHC’s [3]. Finally ssFCM results in improved performance and faster execution time as more weight is given to training documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Fuzzy Semi-Supervised Support Vector Machines Approach to Hypertext Categorization

Hypertext/text domains are characterized by several tens or hundreds of thousands of features. This represents a challenge for supervised learning algorithms which have to learn accurate classifiers using a small set of available training examples. In this paper, a fuzzy semi-supervised support vector machines (FSS-SVM) algorithm is proposed. It tries to overcome the need for a large labelled t...

متن کامل

Document Clustering Based On Semi-Supervised Term Clustering

The study is conducted to propose a multi-step feature (term) selection process and in semi-supervised fashion, provide initial centers for term clusters. Then utilize the fuzzy c-means (FCM) clustering algorithm for clustering terms. Finally assign each of documents to closest associated term clusters. While most text clustering algorithms directly use documents for clustering, we propose to f...

متن کامل

A fuzzy semi-supervised support vector machine approach to hypertext categorization

Hypertext/text domains are characterized by several tens or hundreds of thousands of features. This represents a challenge for supervised learning algorithms which have to learn accurate classifiers using a small set of available training examples. In this paper, a fuzzy semi-supervised support vector machines (FSS-SVM) algorithm is proposed. It tries to overcome the need for a large labelled t...

متن کامل

Semi-supervised Text Categorization Using Recursive K-means Clustering

In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partiti...

متن کامل

Improve Semi-Supervised Fuzzy C-means Clustering Based On Feature Weighting

Semi-supervised learning is somewhere between unsupervised and supervised learning. In fact, most semi-supervised learning strategies are based on extending either unsupervised or supervised learning to include additional information typical of the other learning paradigm. Constraint fuzzy c-means a novel semi-supervised fuzzy c-means algorithm proposed by Li et al [1]. Constraint FCM like FCM ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004